ROUTER REPLAY IN ROLL
The ROLL framework supports Router Replay, a feature that addresses the training-inference mismatch caused by inconsistent expert routing in MoE (Mixture-of-Experts) RL training. By forcing the training-side MoE Router to use a pre-recorded set of routing decisions, Router Replay eliminates routing-level discrepancies at their source and substantially stabilizes training.
Note: ROLL currently implements only the R3 mode (Rollout Routing Replay: SGLang inference + Megatron training). R2 is not yet implemented; the combination of R3 with
sequence_packingis also not yet supported. Both will be added in future releases — please do not enable R3 together withsequence_packingfor now.
1. Background
1.1 Routing Inconsistency in MoE RL
In each MoE layer the Router selects top-k experts per token. In RL training, the same set of weights is used by three different roles:
- Rollout policy: the policy used by the inference engine (e.g., SGLang) for sampling.
- Old policy: the training-side model state right before this batch's gradient updates.
- Training policy: the training-side model that is actively being updated.
Ideally, all three should produce identical routing, but in practice:
- Training vs. inference: the inference and training engines differ in kernel implementations, numerical precision, and parallelism layouts. Even with identical weights, the two sides may select different top-k experts for the same input.
- Across gradient steps: as mini-batch updates proceed, routing decisions also drift along with the weights.
1.2 Why It Matters
Routing is a discrete choice that gets amplified by the downstream expert outputs. When the rollout-side and training-side selected experts disagree, per-token output probabilities diverge significantly, which leads to:
- Inflated importance sampling ratios — many samples in PPO/GRPO become heavily clipped or contribute high-variance updates;
- Training collapse in highly off-policy regimes;
- IS correction or TIS-style loss-side compensation alone is often insufficient to recover stability.
The idea behind Router Replay is simple: rather than trying to fix the discrepancy at the loss layer, fix the routing mask at the architecture layer so that the training side directly reuses a "reference" routing, removing this source of mismatch entirely.
2. Design
2.1 The Replay Formula
Both R2 and R3 share the same mechanism: during the training forward, replace the top-k mask normally produced by router logits with an externally provided mask , and renormalize using the training-side logits :
Key properties:
- The selection of experts is dictated by , not by training-side argmax.
- The softmax is still computed over training-side logits, so router weights still receive gradients normally.
R2 and R3 differ only in where comes from.
2.2 R2 — Vanilla Routing Replay (Not Yet Implemented)
- comes from a forward pass that the training engine itself runs with old policy weights.
- For the first mini-batch, , so the replayed forward matches the original forward (effectively on-policy).
- For subsequent off-policy mini-batches, fixing the routing constrains policy staleness.
R2 addresses "routing drift across gradient steps within the training engine" but does not solve the discrepancy between inference and training engines. R2 is currently unimplemented in ROLL; selecting mode: R2 raises NotImplementedError.
2.3 R3 — Rollout Routing Replay (Supported in ROLL)
- comes directly from the routing recorded by the inference engine during rollout.
- The training side uses an expert selection that is exactly aligned with the sampled trajectory, so the inference-vs-training routing gap is eliminated entirely.
- This also constrains routing drift across gradient steps (same benefit as R2).
R3 mitigates both the training-inference discrepancy and policy staleness simultaneously, and is the recommended path in ROLL.
2.4 End-to-End R3 Flow in ROLL
┌──────────────── ─────────────┐ ┌──────────────────────────────┐
│ SGLang Rollout │ │ Megatron Training │
│ │ │ │
│ generate(...) │ │ forward() │
│ └─ MoE Router top-k │ │ └─ MoE RouterReplay │
│ └─ export indices │ ──────► │ └─ replay indices │
│ [seq, layers, k] │ batch │ in forward │
│ │ data │ │
│ return routed_experts │ │ forward → backward │
└─────────────────────────────┘ └──────────────────────────────┘
- Sampling: while generating tokens, SGLang additionally records the top-k experts per MoE layer, producing a
routed_expertstensor of shape[seq_len, num_layers, top_k]returned with the response. - Data movement: the rollout postprocessor attaches
routed_expertsto each sample in the batch; it then flows through ROLL's standard data path (DP / mini-batch / micro-batch) into the training worker. - Training: before each forward, Megatron sets per-layer RouterReplay indices from
routed_experts; the MoE Router skips its top-k computation and replays the external indices; after forward, the action flips to "replay-on-backward" so activation recomputation uses the same routing.
2.5 R3 + Sequence Packing Is Not Yet Supported
routed_experts is laid out per-sample on the original (unpacked) attention_mask, while sequence_packing concatenates sequences, repads them to a multiple of 2 × CP_SIZE × TP_SIZE, and re-chunks across CP ranks. The two layouts are not yet reconciled, so:
- ROLL currently disallows enabling R3 and
sequence_packingsimultaneously; - When R3 is enabled, please keep
use_sequence_packing: False; - A future release will add the index-remapping logic required to make the two work together.
2.6 Compatibility Matrix
| Feature | R3 Compatibility |
|---|---|
Megatron megatron_train | Supported |
SGLang sglang rollout | Supported (required) |
| Tensor Parallelism (TP) | Supported |
| Pipeline Parallelism (PP) | Supported |
| Virtual Pipeline Parallelism (VPP) | Supported |
| Dynamic Batching | Supported |
| Sequence Packing | Not yet (planned) |
| GSPO | Orthogonal, can be combined |
| TIS / IS correction | Coexists; gains are workload-dependent |
| vLLM rollout | Not supported |
| FSDP / DeepSpeed training | Not supported |
3. Implementation
3.1 Rollout Side (SGLang)
In sglang_strategy.py, when router_replay.mode != "disable":
- The SGLang server is launched with
enable_return_routed_experts=True; - Each generate request sets
return_routed_experts=True; - The chunked outputs collect
routed_expertsfrom each chunk'smeta_infoand assemble per-sample records; - SGLang
>= 0.5.6.post3is required.
3.2 Training Side (Megatron)
In megatron_strategy.py:
- During init,
moe_enable_routing_replayis force-set toTrueso each MoE layer carries aRouterReplayinstance; - For
compute_log_probs-style forwards: replay is enabled only when the batch containsrouted_experts; otherwise (e.g., reference model) the global action is cleared and the router runs as usual; - For
train_step: the strategy asserts thatrouted_expertsis present, setsREPLAY_FORWARDglobally, runs forward/backward, and finally clears global state; - In
inner_forward_step,set_router_replay_datascatters the recorded indices to the current SP rank; after the forward, the action flips toREPLAY_BACKWARDso activation recomputation stays consistent.
3.3 Core Utilities
roll/third_party/megatron/router_replay_utils.py provides:
set_router_replay_data— scatter top-k indices to the local SP rank;RouterReplayHelper— query and toggle per-layer RouterReplay state across PP / VPP;get_routed_experts_dtype— pickuint8/int16automatically based on the number of experts to reduce memory and transfer cost.
4. Configuration
4.1 How to Enable
Set router_replay.mode: R3 on every worker that participates in R3. R3 must be enabled symmetrically on both the rollout and the training side — enabling it on only one side has no effect.
4.2 Parameters
router_replay.mode
disable(default): Router Replay is off.R2: Vanilla Routing Replay. Not yet implemented — configuring this mode raisesNotImplementedError.R3: Rollout Routing Replay, the only supported mode in ROLL today.
4.3 Full Configuration Example
actor_train:
router_replay:
mode: R3
strategy_args:
strategy_name: megatron_train
# R3 + sequence_packing is not yet supported; keep this disabled
use_sequence_packing: False
actor_infer:
router_replay:
mode: R3
strategy_args:
strategy_name: sglang # requires sglang >= 0.5.6.post3
reference:
router_replay:
mode: disable
strategy_args:
strategy_name: megatron_infer
4.4 Usage Recommendations
- Environment & strategies:
actor_infermust usesglang(≥ 0.5.6.post3);actor_trainmust usemegatron_train. - Symmetric enablement: configure
mode: R3on both rollout and training workers — enabling only one side is a no-op. - Reference model: keep
mode: disable. Whenrouted_expertsis missing from the batch, ROLL automatically skips the replay logic. - Disable sequence packing for now: until R3 +
sequence_packingis supported, keepuse_sequence_packing: Falseon every worker that uses R3. - Resource overhead: the
routed_expertstensor ([seq_len, num_layers, top_k]) introduces extra memory and inter-worker transfer cost; ROLL automatically selects the smallest integer dtype to mitigate this. - Relationship with IS / TIS: Router Replay fixes routing at the architecture level, while IS / TIS correct probability divergence at the loss level. They are complementary and can be used together depending on the workload.
- Troubleshooting: if training-inference mismatch persists, verify (a) the rollout response actually carries
routed_experts; (b)moe_enable_routing_replayisTrueon the training side; (c) sequence packing is disabled on every relevant worker.
With Router Replay (R3) enabled, ROLL guarantees strict alignment of MoE routing between rollout and training under TP / PP / VPP parallelism, removing a class of mismatch that loss-level correction alone cannot fully address. Future releases will progressively add R2 and the R3 + sequence_packing combination.